Using place name data to train language identification models

نویسندگان

  • Stanley F. Chen
  • Benoît Maison
چکیده

The language of origin of a name affects its pronunciation, so language identification is an important technology for speech synthesis and recognition. Previous work on this task has typically used training sets that are proprietary or limited in coverage. In this work, we investigate the use of a publicallyavailable geographic database for training language ID models. We automatically cluster place names by language, and show that models trained from place name data are effective for language ID on person names. In addition, we compare several source-channel and direct models for language ID, and achieve a 24% reduction in error rate over a source-channel letter trigram model on a 26-way language ID task.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Language Identification of Bengali-English Code-Mixed data using Character&Phonetic based LSTM Models

Language identification of social media text still remains a challenging task due to properties like code-mixing and inconsistent phonetic transliterations. In this paper, we present a supervised learning approach for language identification at the word level of low resource BengaliEnglish code-mixed data taken from social media. We employ two methods of word encoding, namely character based an...

متن کامل

Statistical Identification of English Loanwords in Korean Using Automatically Generated Training Data

This paper describes an accurate, extensible method for automatically classifying unknown foreign words that requires minimal monolingual resources and no bilingual training data (which is often difficult to obtain for an arbitrary language pair). We use a small set of phonologically-based transliteration rules to generate a potentially unlimited amount of pseudo-data that can be used to train ...

متن کامل

Phonetic Landmark Detection for Automatic Language Identification

This paper presents a method of augmenting shifted-delta cepstral coefficients (SDCCs) with the classification outputs of an array of support vector machines (SVMs) trained to detect a set of manner and place features on telephone speech. The SVM array allows for broad phoneme classification, and when this information is concatenated with SDCCs to form a hybrid feature vector for each acoustic ...

متن کامل

Open-Set Language Identification

We present the first open-set language identification experiments using one-class classification models. We first highlight the shortcomings of traditional feature extractionmethods and propose a hashing-based feature vectorization approach as a solution. Using a dataset of 10 languages from different writing systems, we train a One-Class Support Vector Machine using only a monolingual corpus f...

متن کامل

G2P Conversion of Proper Names Using Word Origin Information

Motivated by the fact that the pronunciation of a name may be influenced by its language of origin, we present methods to improve pronunciation prediction of proper names using word origin information. We train grapheme-to-phoneme (G2P) models on language-specific data sets and interpolate the outputs. We perform experiments on US surnames, a data set where word origin variation occurs naturall...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003